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ABSTRACT 

Learning novel and interesting concepts and relations from rela¬ 
tional databases is an important problem with many applications 
in database systems and machine learning. Relational learning al¬ 
gorithms generally leverage the properties of the database schema 
to find the definition of the target concept in terms of the exist¬ 
ing relations in the database. Nevertheless, it is well established 
that the same data set may be represented under different schemas 
for various reasons, such as efficiency, data quality, and usability. 
Unfortunately, many current learning algorithms tend to vary quite 
substantially over the choice of schema, both in terms of learn¬ 
ing accuracy and efficiency, which complicates their off-the-shelf 
application. In this paper, we formalize the property of schema in¬ 
dependence of relational learning algorithms, and study both the 
theoretical and empirical dependence of existing algorithms on the 
common class of vertical (de)composition schema transformations. 

We study both sample-based learning algorithms, which learn from 
sets of labeled examples, and query-based algorithms, which learn 
by asking queries to a user. For sample-based algorithms we con¬ 
sider the two main algorithm classes: top-down and bottom-up. We 
prove that practical top-down algorithms are generally not schema 
independent, while, in contrast, two bottom-up algorithms Golem 
and ProGolem are schema independent with some modifications. 

For query-based learning algorithms we show that the vertical 
(de)composition transformations influence their learning efficiency. 

We support the theoretical results with an empirical study that demon¬ 
strates the schema dependence/independence of several algorithms 
on existing benchmark data sets under natural vertical (de)compositions. 

1. INTRODUCTION 

Over the last decade, users’ information needs over relational 
databases expanded from seeking exact answers to precise queries 
to discovering and learning interesting and novel relations and con¬ 
cepts |21 11 11 1 14] 1 19| 1 1 8| 1 10| |20[ |26| . In recent years, the database 
community has proposed multiple algorithms and systems that lever¬ 
age the database approaches and techniques to make discovering 
novel patterns from databases easier | |26||48| |21 [ [4] |20| [9|. Given a 
database and training instances of a new target relation, relational 
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Original Schema 

Alternative Schema 

student(stud) 

inPhase( stud,phase) 

y earsInProgramf stud, years) 

professor(prof) 

hasPosition(prof,position) 

publication(title,person) 

student(stud,phase,years) 
professor(prof, position) 
publication(title,person) 


Table 1: Fragments of some schemas for UW-CSE database. Pri¬ 
mary key attributes are underlined. 


learning algorithms attempt to induce general (approximate) defini¬ 
tions of the target in terms of existing relations (30[|38||4l[|48| |. For 
example, given a database with student and professor information, 
the goal may be to induce a Datalog definition of a missing rela¬ 
tion advisedBy(stud,prof) based on a training set of known student- 
advisor pairs. Since the space of possible definitions (e.g. all Data¬ 
log programs) is enormous, learning algorithms employ heuristics 
to search for effective definitions, which generally depend on the 
schema of the database. More generally, statistical relational learn¬ 
ing algorithms (l9]| use the same heuristic mechanisms for structure 
learning, which in turn renders them schema dependent. 

As an example, Table[I]shows some relations from two schemas 
for the UW-CSBj database over students and professors, which is 
used as a common relational learning benchmark |41| |19| . The 
first and original schema was designed by relational learning ex¬ 
perts and is generally discouraged in database community as it de¬ 
livers poor usability and performance in query processing without 
providing any advantages in terms of data quality in return {2] |39[ 
|13| . A database designer may use a schema closer to the alterna¬ 
tive in Table [T] We used the classic learning algorithm FOIL |38| 
to induce a definition of advisedBy(stud,prof) for each of the two 
schemas, resulting in two very different definitions. The original 
schema yielded a more accurate definition based on co-authorship 
information, while the alternative schema led to a definition based 
on information about TA and teaching assignments. 

Generally, there is no canonical schema for a particular set of 
content in practice and people often represent the same information 
in different schemas pj |l6||39||24( . For example, we have also ob¬ 
served that researchers have used different schemas to represent the 
Mutagenesis database, which is another well-known benchmark in 
the field of relational learning |44||27| . People may choose to rep¬ 
resent their data in one schema or another for several reasons. For 
example, it is generally easier to enforce integrity constraints over 
highly normalized schemas (2| |14|. On the other hand, because 
more normalized schemas usually contain many relations, they are 
hard to understand and maintain (39| . It also takes a relatively long 

1 http://alchemy.cs.washington.edu/data/uw-cse 








time to answer queries over database instances with such schemas 
|2j|39). Thus, a database designer may sacrifice data quality and 
choose a more denormalized schema for its data to achieve better 
usability and/or performance. She may also hit a middle ground by 
choosing a style of design for some relations and another style for 
other relations in the schema. Further, as the relative priorities of 
these objectives change over time, the schema will also evolve. 

Thus, users generally have to restructure their databases to some 
proper schema , in order to effectively use relational learning algo¬ 
rithms, i.e., deliver definitions for the target concepts that a domain 
expert would judge as correct and relevant. To make matters worse, 
these algorithms do not normally offer any clear description of their 
desired schema and database users have to rely on their own exper¬ 
tise and/or do trial and error to find such schemas. Nevertheless, 
we ideally want our database analytics algorithms to be used by 
ordinary users, not just experts, who know the internals of these al¬ 
gorithms. Further, the structure of large-scale databases constantly 
evolve GD. and we want to move away from the need for constant 
expert attention to keep learning algorithms effective. 

For similar reasons, current relational learning algorithms will 
not be well suited for Big Data, which is inherently heterogeneous 
and evolving as it obtains its content from many different data 
sources |45[|T|. Moreover, researchers often use (statistical) rela¬ 
tional learning algorithms to solve various important core database 
problems, such as query processing, schema mapping, entity reso¬ 
lution, and information extraction (3] |47] [TO] [T9j[T^ . Thus, the is¬ 
sue of schema dependence appears in other areas of database man¬ 
agement. 

One approach to solving the problem of schema dependence is 
to run a learning algorithm over all possible schemas for a valida¬ 
tion subset of the data and select the schema with the most accurate 
answers. Nonetheless, computing all possible schemas of a DB 
is generally undecidable )16) . One may limit the search space to 
a particular family of schemas to make their computation decid¬ 
able. For instance, she may choose to check only schemas that 
can be transformed via join and project operations, i.e. vertical 
composition and decomposition |[2| |13|. However, the number of 
possible schemas within a particular family for a data set are ex¬ 
tremely large. For example, a relational table may have exponen¬ 
tial number of distinct vertical decompositions. As many relational 
learning algorithms need some time for parameter tunning under 
a new schema (26) , it may take a prohibitively long time to find 
the best schema. Since many relational learning algorithms need 
to access the content of the database, one has to transform the un¬ 
derlying data to the desired schema, which may not be practical 
for a large and/or constantly evolving database. Another possi¬ 
ble approach is to define a universal schema to which all possible 
schemas can be transformed and use or develop algorithms that are 
effective over this schema. The experience gained from the idea of 
universal relation (28), indicates such schemas may not always ex¬ 
ist (2). Users also have to transform their databases to the universal 
schema, which may be quite complex and time-consuming consid¬ 
ering the intricacies associated with defining such a representation, 
especially for large and/or constantly evolving databases. 

The inherent contradiction between generality and effectiveness 
has been observed in machine learning and statistics (42). If a 
learning algorithm is tuned too thoroughly for its training datasets, 
it will be very effective on those datasets, but may be much less ef¬ 
fective elsewhere. We argue that the same parallel can be drawn for 
algorithms that use the structural details of schemas to leant novel 
concepts and relationships. Developers design these heuristics ac¬ 
cording to their observations of the structural properties of entities 
or relationships over schemas that experts deem more natural. With 


limited time and resources, it is not feasible to check if these heuris¬ 
tics capture the desired properties for other schemas. Hence, rela¬ 
tional learning algorithms may also face the danger of over-fitting 
to schemas - akin to a focus on syntax rather than deeper semantic 
measures. 

In this paper, we introduce the novel property of schema inde¬ 
pendence i.e., the ability to deliver the same answers regardless of 
the choices of schema for the same data, for relational learning al¬ 
gorithms. We propose a formal framework to measure the amount 
of schema independence of a relational learning algorithm for a 
given family of transformations. We also analyze and compare the 
schema independence of popular relational learning algorithms to 
find the characteristics of more schema independent heuristics. We 
also leverage concepts from database literature to design schema 
independent algorithms. To the best of our knowledge, the prop¬ 
erty of schema independence has not been introduced or explored 
for relational learning algorithms. Our contributions: 

• We introduce and formally define the property of schema in¬ 
dependence and explore its benefits to relational learning algo¬ 
rithms. We define the property of schema independence for a 
relational learning algorithm as its ability to return the same 
answer over transformations that modify the database schema 
and preserve its information content. We study schema inde¬ 
pendence for two types of relational learning frameworks: 1) 
sample-based learning, e.g. |30| |38[ [4l| |31| , which learn the 
target relation using some training data, and 2) query-based 
learning algorithms, e.g. 0E10. which learn the target 
relation by asking queries to some oracle , e.g., database user. 

• We analyze the property of schema independence for a popular 
family of sample-based learning algorithms, called top-down 
relational learning algorithms (30||38|[4l) . We show that this 
family of algorithms is not schema independent under vertical 
(de)composition transformations. 

• We explore the property of schema independence for another 
family of widely used sample-based algorithms called bottom- 
up algorithms. We formally analyze two typical algorithms 
from this family: Golem (32| | and ProGolem (33). We prove 
that these these algorithms are schema dependent under vertical 
(de)composition. We extend these algorithms and prove our ex¬ 
tensions of are schema independent under vertical (de)composition. 

• We explore the schema independence of some typical query- 
based relational learning algorithms (25[|43( . We prove that the 
number of queries this algorithms needs for successful depends 
on the schema used to represent the data. Our result is strong 
because it further shows that any reasonably good query-based 
algorithm will require drastically more queries under some ver¬ 
tical (de)composition. We also prove that the resource require¬ 
ments of these algorithms can grow exponentially under verti¬ 
cal (de)compositions. 

• We empirically study the schema dependence of popular rela¬ 
tional learning algorithms under vertical (de)composition. Our 
empirical results generally confirm our theoretical results and 
indicate that transforming schema considerably affects the ef¬ 
fectiveness, efficiency, and query complexities of well-known 
relational learning algorithms. Because ProGolem is more effi¬ 
cient than Golem, we evaluate the effectiveness of the extended 
ProGolem algorithm using a widely used benchmark dataset. 
Our results show that the extended ProGolem algorithm is as 
effective as the original version of ProGolem. 

This paper is organized as follows. Section[2]describes the back¬ 
ground. Section [3] formally defines the property of schema inde- 


pendence for sample-based relational learning algorithms. Sec¬ 
tions [5] and [6] explore the schema independence of top-down and 
bottom-up relational learning algorithms, respectively. Section [7] 
defines schema independence for query-based algorithms and ex¬ 
plores their schema independence. Section [8] contains our empiri¬ 
cal results and Section [9] concludes the paper. The proofs for our 
theoretical results are in the appendix. 

2. BACKGROUND 
2.1 Related Work 

The architects of relational model have argued for logical data in¬ 
dependence, which oversimplifying a bit, means that an exact query 
should return the same answers no matter which logical schema is 
chosen for the data In this paper, we extend the principle of 
logical data independence for relational learning algorithms. The 
property of schema independence also differs with the idea of log¬ 
ical data independence in a subtle but important issue. One may 
achieve logical data independence by an affordable amount of ex¬ 
perts’ intervention, such as defining a set of views over the database 
©• However, it generally takes more time and much deeper ex¬ 
pertise to find the proper schema for a relational learning algo¬ 
rithm, particularly for database applications that contain more than 
a single learning algorithm dD Hence, it is less likely to achieve 
schema independence via expert’s intervention. 

Database researchers have applied techniques from query opti¬ 
mization to create usable systems for tuning the parameters of un¬ 
structured learning algorithms |26| . Schema of the database, how¬ 
ever, is not a tuning parameter of a learning algorithm. We also fo¬ 
cus on structured learning algorithms. Further, the authors in |26) 
try a reasonable subset of possible values for learning parameters 
to find the desired settings for the algorithm. Nevertheless, as ex¬ 
plained in Section |T] this approach cannot be successfully applied 
to address the issue of schema dependence. Finding a subset of 
relevant features from the data, is an important step in deploying 
learning algorithms |[4j. When applied in this context, the idea of 
representation independence prefers features that are more robust 
against representational variations in the underlying database. Re¬ 
searchers have realized the need to transform, i.e., wrangle, data 
as an important and widely used operation in data preparation and 
have developed usable systems for data transformation |24| . We 
address the same underlying problem but propose a very differ¬ 
ent approach: making the data analytics algorithm independent of 
representation. Researchers have analyzed the stability of some 
(unstructured) learning algorithms against relatively small pertur¬ 
bations in the data |34| |T7| |7|. We also seek to instill robustness 
in learning algorithms, but we are targeting robustness in a new di¬ 
mension: robustness in the face of variations in the schema of data. 

There is a large body of work on converting a database rep¬ 
resented under one schema to another one without modifying its 
information content |22[ |16[ |29[ [8J. We build on this work by 
exploring the sensitivity of relational learning algorithms to such 
transformations. Researchers have defined other types of schema 
transformations |8). A notable group is schema mappings in the 
context of data exchange, which are defined using tuple generating 
dependencies between source and target schemas eg. This group 
of transformations may modify the information content of and/or 
introduce incomplete information to a database. Nevertheless, for 
the property of schema independence, the original and transformed 
databases should contain essentially the same information. 

Researchers have defined the property of design independence 
for keyword query processing over XML documents |46] |. We ex¬ 
tend this line of work by introducing and formally exploring the 


property of schema independence for relational learning algorithms. 
We focus on supervised learning algorithms and their schema inde¬ 
pendence properties over relational data model. 

2.2 Basic Definitions 

Let A U r be a (countably) infinite set of symbols that contains the 
names of attributes |2|. The domain of attribute A is a countably 
infinite set of values (i.e. constants or objects) that A may contain. 
We assume that all attributes share a single domain dom. A relation 
is a finite subset of Attr. We use the terms predicate and relation 
interchangeably. A tuple over relation R is a total map from the 
set of attributes in R to dom. The relation instance Ir of relation 
R is a finite set of tuples. A constraint restricts the properties of 
data stored in a database. Examples of constraints are functional 
dependencies (FD) and inclusion dependencies (IND). FD A —» B 
in relation R, where A, B C R, states that the values of attribute 
set A uniquely determine the values of attributes in B in each tuple 
in every relation instance Jr. An IND between attribute C € R 
and D € S, for relation S, denoted as R[C] C £[£>], states that in 
all instances of Ir and Is, values of attribute C in any tuple of Ir 
must also appear in attribute D of some tuple of Is- A schema is 
a pair TZ = (R, X), where R is a finite set of relations and X is a 
finite set of constraints. An instance of schema 7?. is a mapping / 
over TZ that associates each relation R € TZ to a relation instance 
Ir. An atom is a formula in the form of R(ui ,..., u n ) where 
R is a relation, n is the number of attributes in R, and each Ui, 
1 < j < n, is a variable or constant. A literal is an atom, or the 
negation of an atom. A definite Horn clause (Horn clause or clause, 
for short) is a finite set of literals that contains exactly one positive 
literal. The positive literal is called the head of the clause, and the 
set of negative literals is called the body. A clause has the form: 

T(u) <r~ L\ (ui), • • • , L n ( u n ). 

An ordered clause is a clause where the order and duplication of 
literals matter. A Horn expression is a set of Horn clauses. A Horn 
definition is a Horn expression with the same predicate in the heads 
of all clauses. A Horn definition is defined over a schema if the 
body of all clauses in the definition contain only literals whose 
predicates are relations in the schema. In this work, we will use 
Horn definitions to define new target relations that are not in the 
current schema. Thus, the heads of all clauses in such definitions 
will be the target relation. The literal associated with the target 
relation does not contain any constant. 

3. FRAMEWORK 
3.1 Relational Learning 

Relational learning can be viewed as a search problem for a hy¬ 
pothesis that deduces the training data, following either a top-down 
or bottom-up approach. Top-down algorithms |38| |30| start from 
the most general hypothesis and employ specialization operators 
to get more specific hypotheses. A common specialization oper¬ 
ator is the addition of a new literal to the body of a clause. On 
the other hand, bottom-up algorithms |32| |33| |31) start from spe¬ 
cific hypotheses that are constructed based on ground training ex¬ 
amples, and use generalization operators to search the hypothesis 
space. Generalization operators include inverse resolution, relative 
least general generalization, asymmetric relative minimal general¬ 
ization, among others. Therefore, a generic relational learning al¬ 
gorithm can be seen as a sequence of steps, where in each step an 
operator is applied to the current hypothesis. 

Inductive Logic Programming (ILP) is the subfield of machine 
learning that performs relational learning by learning first-order 


definitions from examples and an input relational database. In this 
paper we use the names ILP algorithm and relational learning al¬ 
gorithm interchangeably. A relational learning algorithm takes as 
input training data E, background knowledge B, and target rela¬ 
tion T, and learns a hypothesis H that, together with B, entails 
E. E usually contains ground unit clauses of a single target pred¬ 
icate T, which express positive (E + ) or negative ( E~ ) examples. 
H is usually restricted to Horn definitions for efficiency reasons. 
In this paper we consider the case where B is extensional back¬ 
ground knowledge, which consists of ground atoms, and can be 
expressed in the form of a database instance. More formally the 
learning problem is described as follows: 

DEFINITION 3.1. Given background knowledge B, positive ex¬ 
amples E + , negative examples E~. and a target relation T, the 
ILP task is to find a definition H for T such that: 

• Vpe e+,h ab = p (completeness) 

• Vp £ E~, H A B \/L p (consistency) 

In the following sections we provide concrete definitions of several 
relational learning algorithms. 

Example 3.2. Consider using a relational learning algorithm 
and the UW-CSE database with the original schema shown in Ta- 
We[7]to learn a definition for the target relation collaborated(X,Y), 
which indicates that person X has collaborated with person Y. 
The algorithm may return the following definition: 

collaborated(X, Y) «— publication(P, X),publication(P, Y). 

This is a complete and consistent definition with respect to the 
training data, and indicates that two persons have collaborated if 
they have been co-authors. 

In this paper, we study relational learning algorithms for Horn 
definitions. We denote the set of all Horn definitions over schema 
TZ by TTDn. This set can be very big, which means that algorithms 
would need a lot of resources (e.g. time and space) to explore all 
definitions. However, in practice, resources are limited. For this 
reason, algorithms accept parameters that either restrict the hypoth¬ 
esis space or restrict the search strategy. For instance, an algorithm 
may consider only clauses whose number of literals are fewer than 
a given number, or may follow a greedy approach where only one 
clause is considered at a time. Let the parameters for a learning al¬ 
gorithm be a tuple of variables 9 = ( 6 1 ,..., 9 r ), where each 9t is a 
parameter for the algorithm. We denote the parameter space by 0, 
and it contains all possible parameters for an algorithm. We denote 
the hypothesis space (or language) of algorithm A over schema TZ 
with parameters 9 as £4,0- Note that not all parameters affect the 
hypothesis space. For instance, a parameter setting the search strat¬ 
egy to greedy impacts how the hypothesis space is explored, but 
does not restrict the hypothesis space. The hypothesis space £4,0 
is a subset of 'H'D-r [ j 30 1 38, |l9[, and each member of £4,0 is a 
hypothesis. 

Clearly, there is a trade-off between computational resources used 
by an algorithm and the size of its hypothesis space. The hypothe¬ 
sis space is restricted so that the algorithm can be used in practice, 
with the hope that the algorithm will find a consistent and complete 
hypothesis. 

EXAMPLE 3.3. Continuing Example |3.2| consider restricting 
the hypothesis space to clauses whose number of literals are fewer 
than a given number, which we call clause-length. Assume that 
we are now interested in learning a definition for the target rela¬ 
tion colIaboratedProfiX, Y), which indicates that professor X has 


collaborated with professor Y, under the original schema. If we 
set clause-length = 5, the learning algorithm is able to learn the 
complete and consistent definition 

collaboratedProf(X, Y) <— professor(X), professor(Y), 

publication(P, X), publication(P, y). 

However, if we set clause-length = 3, the previous definitions is 
not in the hypothesis space of the algorithm. Therefore, the algo¬ 
rithm is not able to learn this definition or any other complete and 
consistent definition. 

3.2 Schema Independence of Relational Learn¬ 
ing Algorithms 

3.2.1 Mapping Database Instances 

One may view a schema as a way of representing background 
knowledge used by relational learning algorithms to learn the def¬ 
initions of target relations. Intuitively, in order to learn essentially 
the same definitions over schemas TZ and 5, we should make sure 
that TZ and S represent basically the same information. Let us de¬ 
note the set of database instances of schema TZ as I(7Z). In order 
to compare the ability of TZ and S to represent the same informa¬ 
tion, we would like to check whether for each database instance 
I £ TL(TZ) there is a database instance J £ T(S) that contains 
basically the same information as I. We adapt the notion of equiv¬ 
alency between schemas in the database literature to precisely state 
this idea |22||l6| . 

Given schemas TZ and S, a transformation is a (computable) 
function r : X(TZ) —> 1(5). For brevity, we write transforma¬ 
tion r as r : TZ — ¥ S. Transformation r is invertible iff it is total 
and there exists a transformation r _1 : 5 —► TZ such that the com¬ 
position of r and t^ 1 is the identity mapping on X(TZ), that is 
r _1 (r( I)) = I for I £ T(TZ). The transformation r _1 may or 
may not be total. We call r^ 1 the inverse of r and say that r is 
invertible. If transformation t is invertible, one can convert every 
instance I £ X(7 Z) to an instance J £ X(5) and reconstruct I from 
the available information in J. Schemas TZ and 5 are information 
equivalent via transformation r : TZ —> S iff r is bijective. Infor¬ 
mally, if two schemas are equivalent, one can convert the databases 
represented using one of them to the other without losing any infor¬ 
mation. Hence, one can reasonably argue that equivalent schemas 
essentially represent the same information. Our definition of in¬ 
formation equivalence between two schemas is more restricted that 
the ones proposed in |22||16| . We assume that in order for schemas 
TZ and 5 to be information equivalent via t, t _ 1 has to be total. 
Although more restricted, this definition is sufficient to cover the 
transformations discussed in this paper. Employing a framework 
that adheres to the ones used in |22||l6| is future work. 

Example 3.4. In addition to the functional dependencies shown 
in Table let the following inclusion dependencies hold over the 
relations of original schema in this table: student[stud] C 
inPhase[stud], inPhase[stud] C student [stud], 
yearsInProgram[stud\ C student[stud], student[stud\ C 
yearsInProgram[stud], prof essor[prof] C hasPosition[prof], 
hasPosition[prof ] C prof essor [prof]. One may join relations 
student, inPhase, and yearsInPrograms and join relations 
professor and hasPosition to map each instance of the origi¬ 
nal schema to an instance of the alternative schema. Further, each 
instance of the alternative schema can be mapped to an instance 
of the original schema by projecting relation student to relations 
student, inPhase, and yearsInProgram and projecting rela¬ 
tion professor to relations hasPosition and professor. Hence, 


these schemas are information equivalent. 

3.2.2 Mapping Definitions 

Let 'H'Dtz be the set of all Horn definitions over schema 72. In 
order to learn semantically equivalent definitions over schemas TZ 
and S, we should make sure that the sets HDn and HDs are equiv¬ 
alent. That is, for every definition hn £ HDn, there is a seman¬ 
tically equivalent Horn definition in HDs, and vice versa. If the 
set of Horn definitions over IZ is a superset or subset of the set of 
Horn definitions over S, it is not reasonable to expect a learning 
algorithm to learn semantically equivalent definitions over 7 Z and 
S. 

Let Cn be a set of Horn definitions over schema 7 Z such that 
Cn C HDn- Let hn £ Cn be a Horn definition over schema 
IZ and I £ T{TZ) be a database instance. The result of applying 
a Horn definition hn to database instance I is the set containing 
the head of all instantiations of hn for which the body of the in¬ 
stantiation belongs to 1(7 Z). We denote the result of hn over I by 
hn(I)- 

DEFINITION 3.5. Transformation t : IZ —¥ S is definition pre¬ 
serving w.r.t. CnandCs iff there exists a total function 8 T : Cn -A 
Cs such that for every definition hn £ Cn and I £ 1(7 Z), hn{I) 
= S T (hn)(r(I)). 

Intuitively, Horn definitions hn and S T (hn) deliver the same re¬ 
sults over all corresponding database instances in IZ and S. We 
call function S T a definition mapping for r. Transformation r is 
definition bijective w.r.t. Cn and Cs iff t and t - 1 are definition 
preserving w.r.t. Cn and Cs- 

If t is definition bijective w.r.t. equivalent sets of Horn defi¬ 
nitions, one can rewrite each Horn definition over 72. as a Horn 
definition over S such that they return the same results over all cor¬ 
responding database instances of 7 Z and 5, and vice versa. We call 
these definitions equivalent. We use the operator = to show that 
two definitions are equivalent. 

3.2.3 Relationship Between Information Equivalence 
and Definition Bijective Transformations 

In order for a learning algorithm to learn equivalent definitions 
over schemas 72 and 5, where t : 72 —> S, it is reasonable for 
72 and S to be information equivalent via r and r to be definition 
bijective w.r.t. HDn and HDs- Information equivalence guaran¬ 
tees that the learning algorithm takes as input the same background 
knowledge. A definition bijective transformation ensures that the 
learning algorithm can output equivalent Horn definitions over both 
schemas. Nevertheless, it may be hard to check both conditions for 
given schemas. Next, we extend the results in (T§ to find the re¬ 
lationship between the properties of information equivalence and 
definition bijective transformations. 

In this paper, we consider only transformations that can be writ¬ 
ten as sets of Horn definitions. We call these Horn transforma¬ 
tions. Vertical composition/ decomposition are examples of Horn 
transformations. 

EXAMPLE 3.6. The transformation from the original schema to 
the alternative schema in Table^can be written as the following 
set of Horn definitions: 

student(X, Y, Z) «— student(X), inPhase(X, Y), 
yearsInProgram(X, Z). 

professor(X, Y) a- professor(X), hasPosition(X , Y). 
publicationjX , Y) A- publication(X , Y). 


Assume that transformation t : 72 —¥ S and its inverse r~ : 

S —> 72 are Horn transformations. Clearly, the head of each Horn 
definition in r _1 will be a relation in 72. Let hn be a Horn def¬ 
inition in HDn- The composition of hn and r“ , denoted by 
hn o T -1 , is a Horn definition that belongs to HDs, created by 
applying hn to the head predicates of clauses in r _1 jrj. That is, 
hn o t _ 1 (</) = hnfi -1 (J)), for all J £ T(S). We prove the 
following proposition similar to Theorem 3.2 in m- 

PROPOSITION 3.7. Given schemas 7Z and S, if transformation 
t : 72 -A S is a Horn transformation and it is invertible, then t is 
definition preserving w.r.t TTDn and HDs- 

PROOF. Suppose that transformation r : 72 — ¥ S is a Horn 
transformation and it is invertible. We define a function S T : TLDn —> 
TLDs to be 5 T fin) = hn ° t _ 1 for any hn £ TTDn- We know 
that 5 T fin ) £ TLDs- Furthermore, for any hn £ TTDn and 
I £ Tn, hn(I) = hn{r-\r{I))) = fin ° t" 1 )(t(7))) = 
8 T fin){T{I)). Thus, Sr is a definition mapping for t and r is 
definition preserving w.r.t. 7 iDn and TLDs. □ 

Intuitively, if transformation r : 72 —¥ S is an invertible Horn 
transformation, then any Horn definition in HDn can be rewritten 
as a Horn definition in TLDs such that they return the same results 
over equivalent database instances. In the proof of Proposition |3.7| 
the definition mapping function that maps members of TTDn, such 
as hn, to members of TCDs is hn ° t _ 1 . According to Proposi¬ 
tion [XT] if schemas 72 and S are information equivalent via Horn 
transformation r, for all hn £ HDn, hn ° t^ 1 £ HDs, and for 
all hs £ HDs, hs ° cr -1 £ HDn, t is definition bijective w.r.t. 
HDn and HDs- 

Example 3.8. Let TZ be the original schema and S be the al¬ 
ternative schema in Example |j.7| Let r : 72 -A S be the join 
operator, and t -1 : S —> TZ be the projection operator, which 
is the inverse of join. Because of Proposition |3.7| r is definition 
bijective w.r.t. HDn and HDs. 

In this paper, we consider only the Horn transformations that are 
both invertible and definition bijective w.r.t. sets of Horn defini¬ 
tions. 

3.2.4 Schema Independence 

The hypothesis space determines the set of possible Horn def¬ 
initions that the algorithm can explore. Therefore, the output of a 
learning algorithm depends on its hypothesis space. In Example 
|3.3| we showed that an algorithm is able to learn a definition for a 
target relation with some hypothesis space but not in another more 
restricted space. In order for an algorithm to learn semantically 
equivalent definitions for a target relation over schemas 72 and S, 
it should have equivalent hypothesis spaces over TZ and S. We call 
this property hypothesis invariance. Let 0 be the parameter space 
for algorithm A. 

DEFINITION 3.9. Algorithm A is hypothesis invariant under trans¬ 
formation t : 72 — y S iff t is definition bijective w.r.t. Cn.e ar, d 
Cle.forall 9 £ 0. 

Let T be a set of transformations. We say that algorithm A is hy¬ 
pothesis invariant under T if it is hypothesis invariant under r, for 
all t £ T. 

We now define the notion of schema independence for relational 
learning algorithms over a set of bijective transformations. We 
define a relational learning algorithm as a function A(I, E, 9) to 
C'f 9■ That is, taking as input a database instance /, training exam¬ 
ples E, and parameters 9 £ 0, the algorithm outputs a hypothesis 
that belongs to Cf e . 


DEFINITION 3.10. Algorithm A is schema independent under 
bijective transformation t : TZ —¥ S iff t is definition bijective 
w.r.t. T-TD-jz and HT>s and for all I £ P(TZ) and J £ 1(5), all 
9 £ 0, and target relation T, we have: 

• A(I , E, 9) = 5 t (A(t(I), E, 9)), where S T is the definition 
mapping for r. 

• A(J,E,9) = S t -i(A(t~ 1 (J),E,9)), where S t -i is the 
definition mapping for r _1 . 

Again, we say that algorithm A is schema independent under the 
set of transformations Y if it is schema independent under r, for 
all t £ F. Note that if an algorithm is schema independent un¬ 
der transformation t, then it is hypothesis invariant under r. In 
other words, hypothesis invariance under a set of transformations 
is a necessary condition for an algorithm to be schema indepen¬ 
dent under the same set of transformations. Note that it is possible 
for an algorithm to not be schema independent, but be hypothesis 
invariant. In such cases, the cause of schema independence must 
necessarily be related to the search process of the algorithm, rather 
than hypothesis representation capacity. 

Example 3.11. Consider the original schema and the alterna¬ 
tive schema in Table^Ij The original schema is the result of a verti¬ 
cal decomposition of the alternative schema. Consider the learning 
algorithm FOIL. If the target relation is collaboratedProf(X,Y), as 
in Example \3.3\ FOIL is able to learn equivalent definitions under 
the original schema and the alternative schema. However, if the 
target relation is advisedBy(X,Y), FOIL learns non-equivalent def¬ 
initions under these schemas. Under the original schema, it learns 
a definition based on co-authorship: 

advisedBy(X,Y) ■£- student(X),professor(Y), 

publication(P, X), publication(P, Y). 

On the other hand, under the alternative schema, FOIL learns a 
definition based on courses taught by the professor in which the 
student has been TA: 

advisedBy(X, Y) ■£- course(C, 1', X, T , L). 

Therefore, FOIL is not schema independent. 

In this paper, we consider only generic transformations, which 
treat data values as essentially uninterpreted objects |23) . Generic 
transformations are usually allowed to use a bounded number of 
constants. However, the transformations considered in this paper 
use no constants. Considering this type of transformations is rea¬ 
sonable as relational learning algorithms also treat data values as 
uninterpreted objects. 

4. VERTICAL (DE)COMPOSITION 

There are a wide variety of information-preserving transforma¬ 
tions between relational schemas ]22[ |2|. It will take more space 
than a single paper to explore the behavior of relational learning 
algorithms over all such transformations. In this paper, we ex¬ 
plore the schema independence of relational learning algorithms 
under vertical composition/decomposition transformations (2[|36|. 
We select this group of transformations because we have observed 
several instances of these transformations in relational learning re¬ 
search papers and systems. Section[T]presented one of these cases. 
Further, they are widely used in relational databases as a database 
designer may decompose and/or compose their relations to achieve 
the desired trade-off between efficiency, degree of normalization 
and data quality, and schema readability. 


Vertical composition and decomposition may be done in the pres¬ 
ence of functional and/or join dependencies in the schema (2j. Due 
to the limited space, we focus only on vertical composition and de¬ 
composition that involve functional dependencies. Since our anal¬ 
ysis in this paper mainly leverages the structure of the transformed 
schema rather than the properties of its dependencies, we believe 
that our results may extend for the vertical composition and decom¬ 
positions that involve join dependencies. A more careful analysis 
of this case is a subject for future work. 

Following (3^j |, we define vertical decomposition as follows. Let 
FD-jz denote the the closure FDs in schema TZ. We denote relation 
R as R(A) where A is the set of attributes in R. If both INDs 
Ri [4l] ^ R 2 [B] and R 2 [B] C R 1 [A] are in 7 Z, we denote them 
as Ri [A] = R 2 [B] for brevity. We call such IND an IND with 
equality. 

DEFINITION 4.1. A vertical decomposition (decomposition for 
short) of schema TZ with single relation B(A) is a schema S with 
relations Si(Bi),..., Sn(Bn), n > 1, such that 

• A = Ui <i<,iSi(Bi). 

• FD-jz = FDs. 

• AZ3 fll <i< n Si(Bi) 0. 

® Fll Si(Bi). 

• Let C be ni<i< n S)(B;), the inclusion dependencies Si[C] = 
Sj [C], 1 < i,j < n are in schema S. 

If n = 1 in Definition |4.1| then schema TZ will remain un¬ 
changed. As we will see below, the last condition in Definition |4.1| 
is needed to ensure that S does not contain more information than 
TZ (2] |22||36) . We define vertical decomposition of a schema with 
more than one relation as the set of vertical decomposition of all 
its relations. Table [T] depicts an example of decomposition. Re¬ 
lation student in the alternative schema is decomposed into rela¬ 
tions student , inPhase , and yearsInProgram in the original 
schema. One may also rename the attributes after decomposing the 
schema. Our results will hold if such renaming will be applied af¬ 
ter a decomposition. According to Corollary 4.3.2 in |36| , every 
decomposition is bijective. 

A vertical composition ( composition for short) is the inverse of a 
decomposition. Because decomposition is bijective, composition is 
also bijective. We define a composition/decomposition of a schema 
as a finite set of applications of composition or decomposition to 
the schema. Hence, a decomposition/composition may decompose 
some relations in the schema, compose some relations, and leave 
some intact. This transformation reflects the modifications one may 
apply on a schema: compose some relations to improve perfor¬ 
mance, decompose some to achieve quality and/or readability, and 
leave some unchanged. Because both composition and decomposi¬ 
tion are bijective, composition/decomposition is bijective. Because 
composition/decompositions are expressed by projection and natu¬ 
ral join ||2j, they are Horn transformations and generic. Hence, they 
are definition bijective. 

5. TOP-DOWN ALGORITHMS 

Top-down relational learning algorithms follow a covering ap¬ 
proach (38||30) . An algorithm that uses a covering approach con¬ 
structs one clause at a time. After building a clause, the algorithm 
adds the clause to the hypothesis, discards the positive examples 
covered by the clause, and moves on to learn a new clause. Algo- 
rithm|T]sketches a generic relational learning algorithm that follows 
a covering approach. The strategy followed by the LearnClause 
procedure depends on the nature of the algorithm. In top-down 


col\aborated(X. Y) <— true 



<— student(X) -f— inPhase(X.P) ■■■ <— publication P,X) 



<— publication(P,X), pub!ication(P,Y) 


Figure 1: Fragments of a refinement graph for collaborated. 

algorithms, the LearnClause procedure in Algorithm]!] searches 
the hypothesis space from general to specific, by using a refinement 
(specialization) operator. The refinement operator can perform two 
syntactic operations. The first operation is to substitute the vari¬ 
ables in the literals of the clause with fresh variables, other used 
variables, or constants. The second operation is to add a new literal 
to the clause. 


Algorithm 1: Generic relational learning algorithm following 
a covering approach. 

Input : Database instance /, positive examples E + , negative 
examples E~ 

Output: A set of Florn definitions H 

U <- E+- 

while U is not empty do 

C «— LearnClause(I , E + , E~)\ 

H <r- HU C; 

U <^U-{c€ U\HAl\=c}\ 

end 


The hypothesis space in top-down algorithms can be seen as a 
refinement graph, that is a rooted directed acyclic graph in which 
nodes represent clauses and each arc is the application of a basic re¬ 
finement operator. The basic strategy of top-down algorithms con¬ 
sists of starting from the most general clause, which corresponds 
to the root of the refinement graph, and repeatedly refining it until 
it does not cover any negative example. Figure [I] shows fragments 
of the refinement graph for learning the definition of collaborated 
relation over the original schema of Table]!] Because of space con¬ 
straints, we do not show the head of the clause collaborated in any 
node of the refinement graph in Figure[!]but its root. 

The strategy of constructing and searching the refinement graph 
varies between different top-down algorithms. For instance, FOIL |38[ 
|48[|35| is an efficient and popular top-down algorithm that follows 
a greedy best-first search strategy. In this section, we analyze the 
schema independence properties of FOIL. Flowever, the results that 
we show in this section hold for all top-down algorithms no matter 
which search strategy they follow. 

The refinement graph for most schemas, even the ones with a 
relatively small number of relations and attributes, may grow sig¬ 
nificantly ]38[ |30| . Flence, the construction and search over the 
refinement graph may become too inefficient to be practical. To 
be used in practice, FOIL restricts its search space, i.e. hypothe¬ 
sis space. We call the number of literals in a clause its length. A 
common method is to restrict the maximum length of each clause 
in the refinement graph |38[ |30| . Intuitively, because composi¬ 
tion/decompositions modify the number of relations in a schema, 
equivalent clauses over the original and transformed schemas may 
have different lengths. Hence, this type of restrictions may result 
in different hypothesis spaces. One may like to fix this problem 
by choosing different values for the maximum lengths over the 
original and transformed schemas. For example, one may pick a 


smaller value to bound the clause lengths over the schemas with 
smaller number of relations. The following theorem proves that it is 
not possible to achieve equivalent hypothesis spaces over the orig¬ 
inal and transformed schemas by restricting the maximum length 
of clauses no matter what values are used over the original and 
transformed schemas. 

THEOREM 5.1. FOIL is not hypothesis invariant under vertical 
composition/decomposition. 

Now, we describe a method of restricting hypothesis space for 
FOIL that achieves hypothesis invariance. FOIL constructs its hy¬ 
pothesis space by starting from an empty clause and gradually adding 
a new relation or substituting the variables in the current clause. 
Let FOIL learn a target relation over schema S with relations Si, 

1 < i < n. Given relation Si in 5, we call the set of all relations Sj 
such that there is an IND Si [B] = Sj [C] in 5, the inclusion class 
of Si. We make the following modifications to FOIL. First, right 
after adding a new relation Si to the current clause, FOIL adds all 
relations in its inclusion class to the clause. Second, given Si and 
Sj appear in the current clause and we have IND Si [B\ = Sj [ C] 
is in S, FOIL assigns the same variables to attributes B and C. 
Finally, we restrict the hypothesis space by limiting the maximum 
number of inclusion classes in each clause. 

PROPOSITION 5.2. The modified FOIL is hypothesis invariant 
under composition/decomposition. 

One problem with the modified FOIL, and other similarly mod¬ 
ified top-down algorithms for that matter, is that it has to evaluate 
clauses with rather large number of relations. Since most these 
clauses are already minimal, the algorithm may need to join large 
number of relations to evaluate each candidate clause. Hence, the 
learning may be very slow and not practical over relatively large 
databases. Further, FOIL traverses the refinement graph, evaluate 
a set candidate clauses in the graph, and returns the most promis¬ 
ing clause. Hence, to be schema independent, the algorithm must 
evaluate clauses at the same order over equivalent schemas. Let 
r : 7Z —> S be a composition/decomposition. If FOIL generates 
and evaluates clause h-jz before clause h' n over schema TZ, it must 
generate and evaluate the clause 5 T {hn ) before the clause <5 T (/i^) 
over S. One of the operations of (modified) FOIL for generating 
new clauses is assigning variables to attributes in the current clause. 

It is not clear how FOIL can assign variables to attributes such that 
it maintains the same order of generating clauses over 7 Z and S 
without strong assumptions about the schema, such as strong uni¬ 
versal relation and unique attribute role assumptions j2j. Hence, 
the generate and test approach used in top-down algorithms like 
FOIL is generally at odds with schema independence. 

6. BOTTOM-UP ALGORITHMS 

In this section, we consider bottom-up algorithms. Unlike top- 
down algorithms, bottom-up algorithms search the hypothesis space 
from specific to general. They usually construct a specific hypoth¬ 
esis taking as seed one training example, and then they apply gen¬ 
eralization operators on one or more of these hypotheses. More 
specifically, given a positive example, bottom-up algorithms usu¬ 
ally try to find the most specific clause in the hypothesis space that 
covers the example, relative to the background knowledge. This is 
called the saturation or bottom clause. In this section, we analyze 
and propose a modification to the algorithm for bottom clause con¬ 
struction given by |30| . We then analyze two popular algorithms, 
namely Golem [32] and ProGolem |[33j. 






professor(John) hasPosition(John,Associate) publication(A,John) 

professor(Mary) hasPosition(Mary, Assistant) publication^,Mary) 

student(Jake) inPhase(Jake,PreQuals) publication(A,Jake) 

student (Sara) inPhase (Sara,PostQuals) publication^,Sara) 


Table 2: Sample database for UW-CSE Original Schema. 

6.1 Bottom Clause Construction 

Let _L e be the bottom clause associated with example e, relative 
to the background knowledge, i.e., database instance, B. Then, 

_L e is the most specific clause such that B U -L e h 0 e, where h 0 
is a deductive inference operator (e.g. resolution). Therefore, the 
bottom clause contains all information that is relevant to both the 
example and the background knowledge. The bottom clause can 
be computed by inverting the normal deductive proof process |42| . 
Because the resolution operator is complete, the bottom clause can 
be computed by using the inverse resolution operator. However, 
employing inverse resolution is highly inefficient. Inverse entail- 
ment was proposed to overcome this issue ]30) . An algorithm for 
computing bottom clauses using inverse entailment is given in |30| . 
This algorithm assigns variables to constants. Then the literals 
added to the bottom clause may contain variables and constants. 
We call these free literals. Given a seed example, this algorithm 
first adds the free literal corresponding to the example to the head 
of the bottom clause. It then adds free literals to the body of the 
clause in an iterative manner. The database instance is used to de¬ 
termine which literals to add. 

Example 6.1. Consider the database instance shown in Ta¬ 
bleland the example collaborated(J ohn, Jake). The algorithm 
keeps a function that maps constants to variables. When it sees 
an already used constant, it uses the previously assigned variable. 
The algorithm first adds the free literal collaborated(Vi,Vi) to 
the head of the bottom clause, where it assigned Vi to John and 
Vi to Jake. Next, it adds literals to the body of the clause in an 
iterative manner. For instance, in iteration 1 it may add all liter¬ 
als where constants John and Jake appear. Then, the resulting 
bottom clause after iteration 1 would be 

collaboratedfVi, Vi) «— professor(\ i), student(Vi ), 
hasPosition(Vi, Vf),inPhase{Vi, 14), 
publication(Vs ,, Vi),publication(V$, 14). 

This algorithm can generate very long clauses after multiple iter¬ 
ations. For a large database, a bottom clause may include hundreds 
or thousands of literals. This would result in a long running time, 
and therefore would not be useful in practice. Therefore, we should 
avoid generating long bottom clauses. A common method is to re¬ 
strict the maximum depth of any term in the bottom clause (30) . 
The depth of a variable X, denoted by d(X), is 0 if it appears in 
the head of the clause, otherwise it is minveux (d( V)) + 1, where 
Ux are the variables in literals in the body of the clause containing 
X. The depth of a literal is the maximum depth of the variables ap¬ 
pearing in the literal. The depth of a clause is the maximum depth 
of the literals appearing in the clause. In the bottom clause con¬ 
struction algorithm, in iteration i only literals of depth at most i are 
added to the clause. 

EXAMPLE 6.2. The following clause has depth 1 : 

collaborated(X, Y) «— publication(P, X),publication(P, Y). 

Now consider a clause for target relation collabW ithPerson(X, Y), 
which indicates that X and Y have collaborated with the same per¬ 


son. 

collabWithPerson(X,Y) «— publication(Pl, X\\) 
publiccition(Pl, Z),publication(P2, Z),publication(P2, Y). 
This clause has depth 2. 

The algorithms that rely on bottom clauses can be heavily influ¬ 
enced by the literals that appear in the bottom clauses. To achieve 
schema independence, these algorithms should get equivalent bot¬ 
tom clauses associated with the same example, relative to equiv¬ 
alent instances of information equivalent schemas. If this is not 
the case, then algorithms would not be hypothesis invariant. Un¬ 
fortunately, using the depth parameter does not result in equiva¬ 
lent bottom clauses associated with the same example, relative to 
equivalent instances of information equivalent schemas. This is 
because different schemas require different depth values for equiv¬ 
alent clauses. 

Example 6.3. Consider a new schema where we have a new 
relation coauthor [title, per sonl, per son2), which indicates that 
personl and person2 are coauthors in publication title. Then 
we could have the following clause for the collabWithPerson 
relation, which has depth 1 

collabW ithPerson(X,Y) t— (2) 

coauthor (PI, X, Z), coauthor(P2, Y, Z). 

If we set the maximum depth parameter to 1, then under the Orig¬ 
inal Schema, the clause shown in Example \6.2\ wouId not be in the 
hypothesis language, as it contains variables that have depth 2. 
On the other hand, under the new schema, the clause presented 
above would be in the hypothesis language. Therefore, an algo¬ 
rithm that uses this bottom clause construction algorithm with pa¬ 
rameter depth would not be hypothesis invariant. 

We propose the following modifications to the algorithm for gen¬ 
erating bottom clauses so that it produces equivalent bottom clauses 
for the same example, relative to equivalent instances of informa¬ 
tion equivalent schemas. The algorithm follows the normal pro¬ 
cedure of bottom clause construction, where at each iteration, it 
selects a relation and adds one or more free literals of that relation 
to the bottom clause. The algorithm keeps a function that maps 
constants to variables, so that the same variable is assigned to the 
same constant every time. The literals that are added to the clause 
are based on the tuples for this relation in the database. 

When adding literals, the algorithm applies a modified version 
of the Chase algorithm [2J to the bottom clause with regard to the 
available INDs with equality in the schema as follows. Assume that 
the algorithm is generating the bottom clause relative to /, which 
is an instance of schema TZ. Assume that the algorithm selects re¬ 
lation Ri £ TZ and adds a literal L to the bottom clause based on 
some tuple ti of relation Ri. Let L be an inclusion class in TZ, 
which contains Ri. For each constraint R, [A,] = R k [A k \ between 
the members of L, the algorithm checks all tuples of relation R k 
that share join attributes with ti. For each tuple, the algorithm cre¬ 
ates a free literal. If a constant in a tuple has been already seen, 
then it uses the variable that was assigned to that constant. If the 
constant has not been seen, it assigns a fresh new variable. If the 
literal is not redundant with any literal already in the clause, then it 
is added to the clause. Two literals are redundant if they have the 
same relation name and the same old variables, and only differ in 
fresh new variables. If the literal is redundant, then it is ignored. 
The algorithm ensures that the corresponding attributes in A, and 
Ak are assigned the same variables. Because this version of Chase 
algorithm is terminal and enforces the available INDs with equality 




to the clause, the resulting clause is equivalent to the input clause 
|2). We call this procedure chase. 

As explained above, the bottom clauses may get too long. Then, 
we also propose a modification of the algorithm so that the stopping 
condition is based on an alternative parameter called maxvars. 

This parameter indicates the maximum number of (distinct) vari¬ 
ables in a bottom clause before starting a new iteration of the algo¬ 
rithm. That is, at the end of each iteration the algorithm checks how 
many (distinct) variables are contained in the bottom clause. If this 
number is less than the parameter maxvars, then the algorithm 
continues to the next iteration. On the other hand, if the number of 
variables is greater than or equal to maxvars, the algorithm stops. 
Therefore, maxvars does not indicate the exact number of vari¬ 
ables in the clause. Instead, it is used as a stopping criterion in the 
algorithm. 

In the following Lemma, we show that this algorithm delivers 
equivalent bottom clauses associated with the same example, rela¬ 
tive to equivalent instances of information equivalent schemas. Let 
r : 1Z —> S be a composition/decomposition. Let I and J be 
instances of TZ and 5, respectively, such that r(7) = J. 

Lemma 6.4. If _L e ,j and J _ e ,j are bottom clauses associated 
with e relative to I and J, respectively, generated by the algorithm 
described above, then _L e ,i=-L e, j- 

6.2 Golem 

In this section, we consider a bottom-up learning algorithm called 
Golem [32| . Golem, like other learning algorithms, follows a cov¬ 
ering approach, as the one shown in Algorithm]!] Golem’s LearnClause 
procedure follows a bottom-up approach, which is based on the rel¬ 
ative least general generalization {rlgg) operator. 

Given clauses C i and C2, the least general generalization ( Igg ) 
of Ci and Co is the clause C that is more general than C\ and C2, 
but the least general such clause. The notion of generality is defined 
by 0-subsumpfion. Therefore, clause C is more general than Ci if 
and only if C 0-subsumes Ci (and similarly for CV). This notion of 
generality gives a computable generality relation. Further, the Igg 
of two clauses is unique. Because of the lack of space, for further 
details we refer the readers to | ]37[|32| . 

Consider a schema TZ and database instance I G Zr. Let e be 
a positive example for some target relation T. A special case of 
a bottom clause or saturation, used in Golem, is the one where all 
literals in the body of the clause are grounded. Then, the saturation 
of e relative to I, denoted _L e ,r> is the most specific clause e <— I 1 
such that /' C I contains all ground atoms in I that are somehow 
related to e. By somehow related, we mean that the ground atoms 
are linked to e by some chain of ground atoms. The saturation 
can be computed using the approach described previously, by only 
allowing ground literals in the clause and restricting the maximum 
number of constants instead of the maximum number of variables. 
Therefore, the saturations for e relative to I and J are equivalent 
and have an “equivalent” order. The operator that computes the 
Igg for a pair of saturations is called rlgg. The Igg of a set of 
saturations is defined via pairwise operations, that is 

lgg({Ci ,..., C n }) = lgg{lgg({C u ..., C n - 1}), C n ) 

The order of pairwise Iggs does not matter as the Igg operator is 
commutative and associative. 

Given a database instance I and training examples E + and E~, 
Golem’s LearnClause procedure learns a clause that covers as 
many positive examples as possible and no negative examples. The 
procedure first computes the saturation for each example e G E + . 

The set of all saturated clauses is S = {_L e ,j: e G E + }. It then 


tries to find a subset S' of S such that the clause C = lgg( S') cov¬ 
ers many positive examples and no negative examples. Therefore, 
Golem's LearnClause procedure approach is to find the largest 
subset of saturated clauses S' whose Igg covers no negative exam¬ 
ples. Algorithm[2]sketches this procedure. This algorithm uses the 
function covers defined as covers(I, C, E) = { e £ E \ I AC \= 
e}. Intuitively, the algorithm first finds a set of pairs of examples to 
generalize. These examples are picked greedily according to cover¬ 
age. It then greedily includes new examples into the generalization 
as long as no negative examples are covered. 


Algorithm 2: Golem’s LearnClause procedure. 

Input : Database instance I, positive examples E + , negative 
examples E~ 

Output: A clause C* that covers as many positive examples 
and no negative examples. 

U «- E+; 

C = {C = lgg(L e /, _L e / 7 ) | e, e! G U, covers(C, I, N ) = 

0 }; 

while C is not empty do 

C* = argmax CgC \covers(C, I,U)\\ 

C* = reduce(C*y, 
u = U - covers{C*,I,U); 

C = {C = lgg(C*, _L e j) | e G U,covers{C,I,N) = 
0 }; 

end 

return C* 


Golem may generate definitions that contain very large clauses. 
The reason is that the size of a clause generated by lgg(C 1 , Cf), 
where Ci and C 2 are clauses, is bounded by |Ci| ■ | C 2 1. There¬ 
fore, the size of a clause resulting from pairwise Igg operations 
can grow exponentially with the number of clauses from which it 
is generalized. This results in exponential running time. For this 
reason, Golem | |32| employs some techniques to prune clauses in 
the resulting hypothesis. Pruned clauses should be shorter than the 
original clauses, but still express the desired hypothesis. In this pa¬ 
per, we consider a pruning approach called negative reduction. This 
technique consists of permanently removing literals of a clause if 
after their removal, the clause does not cover any negative example. 
More formally, let C' = reduce(C) be the negative reduction op¬ 
eration such that if C is a consistent clause and C' C (7 is a pruned 
clause, C' is also consistent. Given that a good set of negative ex¬ 
amples are available, this technique is effective and efficient |32| . 

Lemma 6.5. The rlgg operator is schema independent w.r.t. 
vertical composition/ decomposition transformations. 

THEOREM 6.6. Golem is schema independent w.r.t. vertical 
composition/ decomposition transformations. 

6.3 ProGolem 

ProGolem is another learning algorithm that follows a bottom- 
up approach |33| . Unlike Golem, it is based on the asymmetric 
minimal general generalization (armg) operator, which is another 
generalization operator. As the previous algorithms, ProGolem fol¬ 
lows a covering approach similar to Algorithm]!] Given the bottom 
clause associated with an example. ProGolem’s LearnClause pro¬ 
cedure uses a greedy beam search to select the best clause gener¬ 
ated by the armg operator with respect to the bottom clause. The 
armg operator uses other positive examples to generalize the in¬ 
put bottom clause, so that the resulting clause covers these exam¬ 
ples, as well as the seed positive example. The clauses kept in the 





beam correspond to different examples used for generalizing the 
bottom clause. The beam search requires an evaluation function to 
score clauses in the beam. We select an evaluation function that 
is agnostic of the schema used, such as coverage, which is simply 
the number of positive examples covered by the clause minus the 
number of negative examples covered by the clause. We already 
showed that we are able to get equivalent bottom clauses associ¬ 
ated with the same example, relative to equivalent instances of in¬ 
formation equivalent schemas. Therefore, in order for ProGolem to 
be schema independent, we must show that the armg operator is 
schema independent. 

ProGolem considers bottom clauses as ordered clauses. This 
means that the order of the bottom clauses can have an impact in 
the result of the algorithm. Therefore, to ensure that the algorithm 
is schema independent, we must force clauses to have an equiva¬ 
lent order. Let r : TZ —» S be a composition/decomposition. Let I 
and J be instances of TZ and S, respectively, such that r(I) = J. 
Let _L e ,j and _L e ,j be the bottom clauses associated with exam¬ 
ple e relative to / and J, respectively. Then, and _L e ,j have 
an equivalent order if given that for Ri, Rj £ 7 Z, the relations in 
the inclusion class of Ri appear before the relations in the inclu¬ 
sion class of Rj in L e j iff the relations in the inclusion class of 
r(Ri) appear before the relations in the inclusion class of r(Rj) in 
-L e ,j- Note that the order of relations within the inclusion classes 
of r(Ri) and r{Rj) in _L e ,j does not matter as long as all relations 
in r(Ri) appear before relations in r(Rj). Therefore, this order is 
a fixed partial order between relations in TZ. 

One may use the content of I to establish a partial order be¬ 
tween inclusion classes in TZ, which is preserved under composi¬ 
tion/decomposition. Let us define the natural join over an inclu¬ 
sion class in schema TZ as the join of all relations in TZ using their 
attributes that appear in INDs with equality. According to Def¬ 
inition |4.1| r does not join the relations from different inclusion 
classes in TZ. It also introduces a new IND with equality only when 
it decomposes a relation in TZ. Further, it eliminates INDs with 
equality from TZ only if it joins some relations within an inclusion 
class in TZ. Hence, we may define a bijective mapping M between 
all inclusion classes in TZ and S such that the natural join over in¬ 
clusion class L in TZ is equal to the natural join of M (L) in S 
for all corresponding instances of TZ and S. Hence, one may use 
the natural joins of inclusion classes to define an order between 
inclusion classes in a database, which is preserved over all compo¬ 
sition/decomposition of the database. In the rest of this section, we 
assume that equivalent bottom clauses have an equivalent order. 

The details for the armg operator are given in ]33) . The car¬ 
dinality of the output clause from the armg operator is restricted 
by the cardinality of the bottom clause used to construct it. This 
clause will cover the seed example, as well as other examples se¬ 
lected in the beam search. Therefore, employing the armg opera¬ 
tor is significantly more efficient than the rlgg operator, as result¬ 
ing clauses grow polynomially, instead of exponentially, with the 
number of examples. The algorithm for constructing ARMGs is 
given in Algorithm [3] Given the bottom clause associated with a 
positive example, the algorithm constructs the ARMG by dropping 
literals from the body of the clause until another positive example is 
covered. The dropped literals are called blocking atoms )33) . The 
following definition employs the h operator, where x h y means 
that y is provable from x. 

DEFINITION 6.7. Let B be background knowledge, E + the set 
of positive examples, e £ E + and C? = T <— Li, ■ ■ ■ , L n be a 
definite ordered clause. Li is a blocking atom if and only if i is the 
least value such that for all 6, e = T9, B I/ {Li, ■ • • , Li)9. 


Algorithm 3: ARMG algorithm. 

Input : Bottom clause _L e , positive example e'. 

Output: An ARMG in armg±(e'\e). 

~Cj is _L e = T <— Li, • • • , L n ', 

while there is a blocking atom Li w.r.t. e' in the body of do 
Remove Li from C?; 

Remove atoms from 6? which are not head-connected; 

end 

Return G ; 


THEOREM 6.8. The armg operator is schema independent w.r.t. 
composition/ decomposition. 

7. QUERY-BASED ALGORITHMS 

In this section, we consider query-based learning algorithms, 
which leam exact definitions by asking queries to an oracle |25[|40| 
[43][6][3j|. These type of algorithms have been recently used in var¬ 
ious areas of database management, such as finding schema map¬ 
pings and designing usable query interfaces |1Q|[3|. Queries can be 
of multiple types, however the most common types are equivalence 
queries and membership queries. In equivalence queries (EQ), the 
learner presents a hypothesis to the oracle and the oracle returns 
yes if the hypothesis is equal to the target relation definition, oth¬ 
erwise it returns a counter-example. In membership queries (MQ), 
the learner asks if an example is a positive example, and the oracle 
answers yes or no. Query-based algorithms are theoretically evalu¬ 
ated by their query complexity - the asymptotic number of queries 
asked by the algorithm. Particularly, we evaluate the lower and up¬ 
per bounds on the query complexity of these algorithms. A good 
query-based algorithm is the one that does not ask many queries, 
as the more queries are asked, the more resources are required. For 
instance, in the query interface described by j3J, the oracle is the 
user. Therefore, the smaller number of questions asked to the user 
makes the interface more usable. 

In this paper, we focus on the A 2 algorithm by Khardon )25) , a 
query-based learning algorithm that learns function-free, first-order 
Horn expressions. The reasons for choosing this algorithm are three 
fold: i) A2 is representative of query-based learning algorithms that 
work on the relational model, ii) there is an implementation of the 
algorithm (6j, iii) A2 is a generalization to the relational model of 
a classic query-based propositional algorithm J5J. 

Because query-based algorithms follow a different learning model, 
in this section we follow a different approach by analyzing the im¬ 
pact of schema transformations on the query complexity of learn¬ 
ing algorithms. If a query-based algorithm is schema independent, 
it should be able to learn exact definitions with asymptotically the 
same number of queries under equivalent schemas. For this pur¬ 
pose, we compare the lower bound on the query complexity of these 
algorithms against the upper bounds on their query complexities 
over their composition/decomposition. We argue that if the lower 
bound under one of the schemas is greater than the upper bound 
under another schema, then the algorithm is not schema indepen¬ 
dent. Of course, this is not a desirable property, as this means that 
the choice of representation has a huge impact on the performance 
of the algorithm. However, we prove that algorithms such as A 2 
suffer from this property. 

THEOREM 7.1. Let Ll(f)n and 0{g)n be the lower bound and 
upper bound, respectively, on the query complexity of A2 for all 






Original 

4NF 

student(stud) 
inPhase(stud,phase) 
yearsInProgram(stud,years) 
professor(prof) 
hasPosition(prof,position) 
courselevel(crs,level) 
taughtby(crs,prof,term) 
ta(crs, stud, term) 
publicationProfessor(title,prof) 
publicationStudent(title,stud) 

student(stud, phase, years) 

professor(prof,position) 

courselevel(crs,level) 

taughtby(crs,prof,term) 

ta(crs,stud,term) 

publicationProfessor(title,prof) 

publicationStudent(title,stud) 

Denorm. 1 

Denorm. 2 

student( stud,phase, years, title) 
professor(prof,position,title) 
course(crs,prof,stud,term,level) 

student(stud, phase, years) 
professor(prof,position) 
course(crs,prof,stud,term,level) 


coauthor(title,prof,stud) 


Table 3: Schemas for the UW-CSE DB with primary keys under¬ 
lined. 

target relations under schema 1Z. Then, there is a composition/ 
decomposition oflZ, S, such that Q(f)n > 0(g)s- 

The lower bound of A2 is actually the Vapnik-Chevonenkis di¬ 
mension (VC-Dim) of the hypothesis language that consists of function- 
free, first-order Horn expressions. Therefore, we have proven in 
Theorem 17.11 that there are cases where the lower bound on the 
query complexity of any algorithm under this hypothesis language 
is greater than the upper bound on the query complexity of A2. 
This means that any algorithm that is as good as A2 (does not ask 
more queries than A2) is highly dependent on the schema details. 

In query-based algorithms, the running time is dependent on the 
number of queries asked to the oracle. For instance, the running 
time of the A2 algorithm is polynomial in the upper bound on the 
query complexity and n k 1251. Parameters n and k are not depen¬ 
dent on the schema. Therefore, as in other families of algorithms, 
the running time of A2 is exponential in the maximum arity and lin¬ 
ear in the number of relations. This results in the algorithm taking 
longer on schemas that contain relations with large arity. 

8. EMPIRICAL RESULTS 
8.1 Sample-based algorithms 

We evaluate the average-case schema independence of three pop¬ 
ular relational learning algorithms: FOIL (38), Progol |30| , and 
ProGolem (33). Progol is a top-down algorithm similar to FOIL, 
but performs a non-greedy search over the hypothesis space, which 
is restricted by some parameters. We emulate both FOIL and Pro¬ 
gol using Aleplf] a well known Inductive Logic Programming (ILP) 
system. We use the names A-FOIL and A-Progol to indicate the 
results of these systems. ProGolem is implemented in GILPS0 an¬ 
other ILP system. The configurations of Aleph and GILPS systems 
are in Appendix [B] 

We have used UW-CSE and IMDtfjDBs for our experiments. 

We represent the UW-CSE DB using four equivalent schemas: the 
original schema, its transformed 4th normal form, and two denor- 
malized schemas, which are all shown in Table [3] The schemas 
contain the required FD and INDs explained in Section [4] For 
the UW-CSE dataset, we use 939 tuples and 46 positive examples. 

We generate negative examples using the closed-world assumption, 
and then sample these to obtain twice as many negative examples 

2 http://www.cs.ox.ac.uk/activities/machlearn/Aleph/aleph.html 
3 http://www.doc.ic. ac.uk/ jcs06/GILPS/ 

4 http://www.imdb.com 


Original Single Lookup 

movies(id,title,year) 
genresfid,genre) 
ratings(id,rank, votes) 
certificates(id,country,cert) 
directors(directorid.directomame) 
movies2directors(id,directorid) 
writers(writerid.writemame) 
movies2writers(id,writerid) 

Common relations 

keywords(id,keyword) plots(id,plot) countries(id,country) 

business(id,text) altversions(id, version) colorinfo(id,color) 

runningtimes(id,times) prodcompanies(id.name) 

actors(actorid,actomame) movies2actors(id,actorid,character) 

producers(producerid,producemame) movies2producers(id,producerid) 

Table 4: Schemas for the IMDb. Relations in bottom are contained 
in both schemas. 


Algorithm 

Metric 

Original 

4NF 

Denorm. 1 

Denorm. 2 


Precision 

0.91 

0.40 

0.41 

0.61 

A-FOIL 

Recall 

0.73 

0.44 

0.55 

0.92 


Time (s) 

1.33 

1.48 

14.86 

1.78 


Precision 

0.84 

0.56 

0.49 

0.86 

A-Progol 

Recall 

0.89 

0.73 

0.40 

0.92 


Time (s) 

3.39 

2.84 

21.10 

3.17 


Precision 

0.86 

0.86 

0.86 

0.86 

ProGolem 

Recall 

0.92 

0.92 

0.92 

0.92 


Time (s) 

9.82 

9.66 

23.10 

9.66 


Table 5: Results of learning relations over UW-CSE. 


as positive examples. We leam the relation advisedByfstud,prof), 
which indicates that student stud is advised by professor prof. 

We also run experiments using a subset of the IMDb. We obtain 
movies that were high grossing in their opening weekend by query¬ 
ing the IMDb website. We then obtain information such as actors, 
directors, genres, countries, etc. for each movie from the JMDB 
databasqj which contains the same information found in the IMDb 
website, in relational format. We use a sample of 249160 tuples 
from JMDB. We represent IMDb using two schemas: the original 
JMDB schema and an alternative schema called “Single Lookup”, 
where all attributes shown in the summary box of each movie in the 
IMDb website are in a single relation. These schemas are shown 
in Table [4] We hypothesize that to improve the response time of 
showing the summary information about a movie, a database de¬ 
signer may compose all the attributes of the summary box in one 
relation to avoid expensive joins. We leave other relations in the 
original JMDB schema intact. We enforce the required INDs con¬ 
straints on the original schema to make the transformation an in¬ 
formation preserving composition/decomposition. We learn the re¬ 
lation female Actor (per son), which indicates that person is a 
female. We sample the training data and obtain 250 positive exam¬ 
ples and 500 negative examples. 

We evaluate the precision, recall, and running time of the algo¬ 
rithms as the average over 5-fold cross validation. Precision is the 
proportion of true positives against all the positive results and re¬ 
call is the proportion of true positives against all positive examples. 
Tables [5] and [6] show the results for the UW-CSE and IMDb DBs, 
respectively. These results show that the algorithms are not gener¬ 
ally schema independent as the effectiveness measures vary quite 
widely across different schemas. ProGolem appears to be more ro¬ 
bust than others. For instance, ProGolem generates equivalent defi¬ 
nitions across all schemas of UW-CSE. On the other hand, A-FOIL 
and A-Progol show large differences in precision and recall across 

5 http://www.jmdb.de 


movies (id,title,year.genre, 

rank,votes,certcountry,cert, 
directorid,directorname, 
writerid, writername) 





















Algorithm 

Metric 

Original 

Single Lookup 


Precision 

0.63 

0.60 

A-FOIL 

Recall 

0.43 

0.44 


Time (m) 

0.88 

20.49 


Precision 

0.61 

0.59 

A-Progol 

Recall 

0.42 

0.44 


Time (m) 

5.91 

24.55 


Precision 

0.89 

0.98 

ProGolem 

Recall 

0.34 

0.30 


Time (m) 

5.21 

79.47 


Precision 

0.98 

M-ProGolem 

Recall 

0.30 


Time (m) 

59.34 


Table 6: Results of learning relations over IMDb. 




Max vars 

Algorithm 

Metric 

5 

10 

20 

40 


Precision 

0.86 

0.86 

0.86 

0.86 

M-ProGolem 

Recall 

0.92 

0.92 

0.91 

0.91 


Time (s) 

36.43 

37.81 

102.75 

481.66 


Table 7: Results of learning relations over UW-CSE with modified Pro- 
Golem. 


different schemas for both datasets. This is because ProGolem re¬ 
lies less on heuristic guidance compared A-FOIL and A-Progol. 
ProGolem is also more effective than A-FOIL and A-Progol over 
UW-CSE. It is also significantly more precise on IMDb than top- 
down algorithms, but has a lower recall. Our hypothesis is that this 
happens because bottom-up algorithms are based on generalization 
operators that take bigger search steps. 

A particularly notable observation is that the performance of all 
learning algorithms is decreased when using the Denormalized 1 
schema of the UW-CSE DB and the Single Lookup schema of the 
IMDb. In A-FOIL and A-Progol, using this schema results in very 
bad precision and recall. However, this is not the case for schema 
Denormalized 2. This is because the Denormalized 2 schema con¬ 
tains the relation coauthor(title,prof stud), which is highly predic¬ 
tive for the target relation. Finally, we can see that the runtime 
of all learning algorithms is increased when using the Denormal¬ 
ized 1 schema and the Single Lookup schema. Such an increase in 
runtime would likely discourage users who use more denormalized 
schemas, unless they have prior knowledge of which relations can 
be predictive of the target relation. 

In order to achieve schema independence, we have modified the 
bottom clause construction algorithm of ProGolem, as described in 
Section [6] We call this algorithm M-ProGolem. We added a new 
parameter called maxvars. In this algorithm, in iteration i all liter¬ 
als that contain variables of depth at most i are added to the bottom 
clause. With the partial order assumption, this algorithm is schema 
independent. Therefore, all schemas obtain the same precision and 
recall, and only their running time varies. It may not be clear to the 
user how to set the maxvars parameter. One approach is to exper¬ 
iment with a few reasonable values and evaluate them in terms of 
accuracy and running time. Table [7] shows the results for running 
M-ProGolem over UW-CSE for different values of maxvars. We 
observe that the output of the algorithm is generally not sensitive to 
a particular choice of maxvars. We set the maxvars parameter 
10 for IMDb. The results of this experiment is shown in the last 
row of Table [6] As can be seen, M-ProGolem is at least as pre¬ 
cise as ProGolem over both datasets. It also has the same recall as 
ProGolem in almost all cases over both datasets. 

8.2 Query-based algorithms 

We use the LogAn-H system j6), which is an implementation of 


the A2 algorithm |25| . Specifically, we use the interactive algo¬ 
rithm with automatic user mode. In this mode, the system is told 
the Horn definition to be learned, so that it can act as an oracle. 
Then the algorithm’s queries are answered automatically until it 
learns the exact definition. We generated random Horn definitions 
over the alternative schema of the UW-CSE DB, shown in Table[I] 
The only parameter for generating each clause in a definition is the 
number of variables in the clause. To generate the head of each 
clause, we created a new relation of random arity, where the mini¬ 
mum arity is 1 and the maximum arity is the maximum arity of the 
relations in the alternative schema. The body of each clause can 
be of any length as long as the number of variables in the clause is 
equal to the specified parameter and all variables appearing in the 
head relation also appear in any relation in the body. The body of 
the clause is composed of randomly chosen relations, where each 
relation can be the head relation (allowing for recursive clauses) or 
any relation in the input schema. Head and body relations are pop¬ 
ulated with variables, where each variable is randomly chosen to be 
a new (until reaching the input number of variables) or already used 
variable. Clauses cannot contain function or constant symbols. 

After generating each random Horn definitions over the alter¬ 
native schema, we transformed these expressions to the original 
schema simply doing vertical decomposition to each of the clauses 
in an expression. Then, we minimized all definitions using the Ho¬ 
momorphism theorem and the Chase algorithm with the functional 
and inclusion dependency constraints j2j. We varied the number 
of clauses in a definition to be between 1 and 5, each containing 
between 4 and 8 variables. We generated 50 random definitions for 
each setting, getting a total of 250 expressions for each number of 
variables. The A2 algorithm takes as input the target expression 
and the signature. The signature consists of the names of all rela¬ 
tions in the input schema and the head relation, as well as the arity 
of each relation. We ran the LogAn-H system with the original 
definition over the alternative schema and the transformed defini¬ 
tion over the original schema, and recorded the number of queries 
required to learn each definition. In these experiments, we report 
the query complexity - number of equivalence queries (EQs) and 
membership queries (MQs) - of the A2 algorithm. 

The number of EQs and MQs asked by the algorithm under the 
original and alternative schemas is presented in Figure[2] The aver¬ 
age number of EQs required by the A2 algorithm over both schemas 
is constant for different number of variables. However, this is not 
the case for MQs. Particularly, we can see that the number of MQs 
increased with the more decomposed schema, that is the original 
schema. 



Figure 2: Average number of membership queries required by the A2 al¬ 
gorithm. 

9. CONCLUSION 

We formally defined the novel property of schema independence 
for relational learning algorithms, which states that the output of 




















these algorithms should not depend on the schema used to repre¬ 
sent their input databases. We prove that current popular relational 
learning algorithms are not schema independent. We used the de¬ 
pendencies in the schema to extend current bottom-up learning al¬ 
gorithms and proved that the resulting algorithms are schema in¬ 
dependent. Our empirical results on benchmark and real datasets 
validated our theoretical results and showed that the proposed algo¬ 
rithms are as effective as the current relational learning algorithms. 
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APPENDIX 
A. PROOFS 

Proof of Theorem l5.ll 

PROOF. Let r : 7?. —» S be a composition/decomposition. With¬ 
out loss of generality, we assume that (distinct) relations Ri (A, B, C) 
and R 2 (D, B, E) belong to schema 1Z. We also assume that r de¬ 
composes Ri to Si (A, B) and S 2 {B, C) and R 2 to S 3 (D , B) and 
S 4 ,(B,E), respectively. Let l be the maximum clause length and 
9 — {l) be the parameter settings for FOIL. Without loss of gener¬ 
ality we set the value of l to 2. Let C^.e and £ 5,0 be the hypothesis 
languages over TZ and S, respectively. Let T(X, Y ) be the target 
relation. Consider hypothesis hu 

T(X, Y ) «- Ri(X, Z, W), R 2 (Y, Z, V). 

over schema TZ. The mapped hypothesis 8 t (Jiti) is: 

T(X, Y) +- Si(X, Z), S 2 {Z, W), S 3 (Y, Z), S 4 {Z, W). 

Because h-n is minimal and all literals in the body of the clause in 
8r(hn.) are different, S T (hiz) is also minimal. Hypothesis hm is 
in the hypothesis language £4 g because clauselength(H-n) < 

2. However, hypothesis S T (hn) is not in the hypothesis language 
g because clauselength(8 T (Hji)) > 2. There may be other 
definitions semantically equivalent to 5 T (hn), however they will 
have clause length greater than or equal to clauselength(5 T (hu)). 
Thus, they will not be in Cg g either. Therefore, hypothesis spaces 
£^ e an d £s g are not equivalent. Now, let us use another param¬ 
eter setting 9' for schema S where the value of l is 4 so that the 
hypothesis 5 T {h-R.) becomes a member of £ 5 , 0 - In this case, the 
following hypothesis will also be a member of £5 0 

T(X,Y) «- Si{X, Z),S!{X, W),Si{X,T),Si{X,Y). 

However, the equivalent hypothesis to this hypothesis over TZ is: 

T(X, Y) +- Ri(X, Z, Vi), Ri(X, W, V 2 ),Ri{X, T,V 3 ), Ri(X, Y, 

where Vi, 1 < i < 4 are fresh variables. Since this hypothesis over 
TZ is minimal, one has to change l over TZ to 4 to achieve equiva¬ 
lent hypothesis spaces over TZ and S. Hence, we have to alternate 
between the parameter settings over TZ and S without any stopping 
condition. Thus, there is not any fixed values for the parameters to 
ensure the hypothesis equivalence over schemas TZ and S. □ 

Proof of Proposition |5.2| 

PROOF. Let t : TZ —> S be a composition/decomposition. With¬ 
out loss of generality, let Ri and R 2 be relations in TZ such that 
either Ri or R 2 participate in only one IND RI [A] = R2[B], R\ 


and R 2 will always appear together in all clauses of the hypothesis 
space of modified FOIL. Without loss of generality, we set the max¬ 
imum number of inclusion classes for modified FOIL over TZ to 1. 
Let c be the set of clauses that contain a single occurrence of Ri 
and a single occurrence of R 2 . The set c is a subset of the hypothe¬ 
sis space of modified FOIL over TZ. Let us assume that r maps rela¬ 
tion Ri to a set of relations Si and R 2 to a relation S 2 Si. There 
is an INDs with equality between every pair of relations in Si. Fur¬ 
ther, because composition/decomposition preserves INDs, there is 
an IND with equality between Si and S' 2 . Hence, all clauses in the 
hypothesis space of modified FOIL over S that contain a relation 
in Si will have the rest of relations in Si and S 2 ■ We call this set 
of clauses c!. Because composition/decompositions do not intro¬ 
duce any new IND with equality, these clauses do not contain any 
other relation and satisfy the restriction on the maximum number 
of inclusion classes. Because the same variables will be assigned 
to the attributes that participate in the INDs with equality, there is 
a bijection M between all clauses in c and c' such that M (c) and c 
are equivalent. The proof extends for all definitions built over these 
clauses. □ 

Proof of Lemma l6Al 

PROOF. Assume that _L ej /^_L e ,j. Then, there is a literal L in 
X e ,i, but the literals in t(L) are not in _L e ,j. Assume this lit¬ 
eral has relation name R, where R is a relation in TZ and r(R) = 
Si,- - - , S m . We denote literal L simply by R and literals in t(L) 
simply by S 1 , • ■ • , S m . According to our assumption, none of 
Si,-” , Sm are in _L e ,j. This means that variables in J_ e ,j were 
exhausted in other literals and the maxvars value was reached. 
We denote the set of these literals L 5 . According to the algorithm, 
all literals in L 5 have depth lower than any literal in Si, • ■ ■ , S m . 
If this was not the case, at least one of Si, ■ • ■ , S m would be in 
_L e ,j. Then, all literals in t~ (Ls) must have depth lower than 

R, because we are using the same example and equivalent DB in¬ 
stances I and J. Therefore, all literals in r _ 1 (Ls) are added to 
J_ e ,/ before R. If the algorithm was able to add literal R to _L e ,/, 
then the algorithm should be able to add at least one of Si, • • • , S m 
to _L e , j- Assume the added literal is Si. The algorithm then applies 
the chase procedure, which adds the literals S 2 , ■ • • , S m to _L e ,j. 
Therefore, we have a contradiction. □ 

Proof of Lemma 1675] 

PROOF. Let r : TZ —> S be a bijective transformation that 
is a vertical composition/ decomposition between schemas TZ = 
(Ft, Sr) and S = (S, S 5 ). Let I and J be instances of TZ and 

S, respectively, such that r(J) = J. Let T be the target relation, 
and ei = T(ai, ■ • ■ , ai) and e 2 = T(bi, • ■ • , bi) be two positive 
examples. Let (ei «— /() and (e 2 «— I 2 ) be the saturations un¬ 
der schema TZ for ei and e 2 , respectively, such that I[, I 2 C I. 
Similarly, let (ei J[) and (e 2 <— J 2 ) be the saturations under 
schema S for ei and e 2 , respectively, such that J[, J 2 Q J- 

We show that the result of the rlgg operator for examples ei and 
V 4 ) .62 is equivalent under schemas TZ and S. That is 

rlggv.(e I,e 2 ) = rlgg s {e i,e 2 ) 

lgg((ei <- I’l), (e 2 <- I 2 )) = lgg((ei «- J[), (e 2 «- j' 2 )) 

We know that (ei «— I[) and (e 2 «— I 2 )) are clauses. Therefore, 
lgg{( e 1 t— I[), (e 2 «— I 2 )) is the set of pairwise Igg operations 
of compatible ground atoms in (ei I[) and (e 2 <— 1 2 ). Two 
atoms are compatible if they have the same relation name. We show 
that the Igg of compatible ground atoms under schema TZ delivers 
equivalent results under schema S. 

Let R £ R be a relation in TZ such that r(R) = Si, • ■ • , S m , 
1 < m < | S|. Because of Corollary 4.3.2 in 136], we know 


that if r is bijective, Es contains inclusion dependencies between 
the join attributes of Si, ■ ■ ■ , S m ■ Let n = R(ai, ■ ■ ■ , a k ) and 
r 2 = R(a'i, • • ■ , fflfc) be two ground atoms in I. Then, r(ri) = 
Si(ii) 5 * ’ * 5 Srn (fm) and r(? 2 ) 7>'i (f i), ■ ,'3'm (fm) are ground 

atoms in J, where t; and t[, 1 < i < m, are tuples. Then, the Igg 
of ground atoms ri and r 2 is defined as 

lgg(rur 2 ) = R{lgg{a 1 ,a 1 ), ■ ■ ■ ,lgg{a k ,a' k )) 

By applying transformation r, this is equivalent to 
Si(«i),&(aa) j * * * j Sm (fim) 

where Sj is a tuple that contains a subset of attributes in {lgg[ai ,a[), 
■ ■ ■ , lgg(a k , a' k )} for 1 < j < m. By definition of the Igg opera¬ 
tor, we get 

Si(si), S 2 (s 2 ), ■■■ , Sm(sm) = lgg(Si(t 1 ), Si(t[)), 

5 

lgg(S„l (I'm.) , Srn(tm)) 

= lgg(T(ri),T(r 2 )) 

□ 

Proof of Theorem l6.6l 

PROOF. Golem follows a covering approach. Any algorithm 
that follows a covering approach is schema independent if its 
LearnClause procedure is schema independent. Golem’s 
LearnClause procedure consists in finding the largest subset of 
saturated clauses whose Igg covers no negative examples. 

Let r : TZ —> S be a bijective transformation that is a vertical 
composition/ decomposition between schemas TZ = (R, Er.) and 
S = (S, Es). The algorithm first finds a set candidate clauses, 
which are rlggs between pairs of examples. These clauses are 
picked greedily according to coverage. We have shown in Propo¬ 
sition [63] that the rlgg operator is schema independent w.r.t. verti¬ 
cal composition/ decomposition transformations. The coverage of 
equivalent clauses under TZ and S must be the same, as it only de¬ 
pends on the set of examples and not on the schema. Therefore, the 
set of candidate clauses is the same under TZ and S. 

The algorithm then finds the clause with best coverage. Again, 
this depends only on the set of examples and not on the schema. 

It then reduces this clause and finds other candidate clauses by 
generalizing this clause with other positive examples. Because the 
rlgg operator is schema independent and the set of positive exam¬ 
ples is the same under both schemas, the resulting set of candidate 
clauses must be the same. This procedure iterates until there are no 
more candidate clauses. We now show that negative reduction of 
two equivalent clauses under equivalent schemas delivers equiva¬ 
lent pruned clauses. Let h-n and hs be definitions for target relation 
T under schemas 1Z and S, respectively. Without loss of general¬ 
ity, assume that hn and hs contain the single clause Cn and C's, 
respectively. 

Let C'n and C$ be pruned clauses such that reduce(Cn) = C'n 
and reduce(Cs) = C's■ We assume that relations in Cn and Cs 
are considered for removal in a fixed order. That is, if for relations 
Ri,R 2 £ TZ , Ri is considered before R 2 , then relations r(Ri) are 
considered before relations r(R 2 ) under schema S. We explain 
how to compute this order in Section |63| 

Assume that relation R is in clause Cn- Then, because Cn = 
Cs, relations S 1 , ■ ■ • , S m must be in clause Cs- We consider two 
cases: 1 ) R £ C'n or 2) R (f C'n- We show that if case 1 occurs, 
then Si, ■ ■ ■ , Sm € C' s , and if case 2 occurs, then Si , • • ■ , S m ^ 

C s - 


Case 1: Assume that some relation Si € {Si, ■ ■ • , S m } is not 
in C's- This means that Si was dropped without making the clause 
inconsistent. However, we know that there are inclusion dependen¬ 
cies in Es between Si and other relation(s) in Si, • • ■ , Sm |36) . 
Therefore, we can get an equivalent clause by applying C's = 
chases s ( C ' s ), which will contain Si- Therefore, we get that C'n = 
C's- 

Case 2: Assume that some relation Si £ {Si, ■■■ , S'm} is in 
C's ■ If we were able to drop relation R from Cn without making it 
inconsistent, because t(R) — Si, - - ■ , S m , then we should be able 
to drop all relations S 1 , • • ■ , S m without making Cs inconsistent. 
Therefore, we have that Cn = C's- □ 

Proof of Theorem l6.8l 

PROOF. The armg operator is computed by the ARMG algo¬ 
rithm. Let r : TZ —> S be a bijective transformation that is a ver¬ 
tical composition/ decomposition between schemas TZ = (R, E n) 
and S = (S,E,s). Let R £ R be a relation in TZ such that 
t(R) = Si, • • ■ , S m , 1 < rn < |S|. Let / and J be instances of 
TZ and S, respectively, such that r(7) = J. Because of Corollary 
4.3.2 in (36) , we know that if t is bijective, Es contains inclusion 
dependencies between the join attributes of Si, ■ ■ ■ , S m - 

We fist show that we get equivalent results for transformation r, 
which is a vertical decomposition. Assume the input to the ARMG 
algorithm under schema TZ is the bottom clause X e ,i and positive 
example e!. The bottom clause _L e ,j is the ordered clause Cn- 
Assume it has the form 

T(w) <- Li(ui), ■■■ , Li-i(ui-i), 

R(ui), 

Li-\-l ('Wi+l)) * * * j Ln{Un) ■ 

where w is a variable tuple, Lj is a literal with predicate symbol 
in TZ, and Uj is a free tuple, 1 < j < n. Because r is a bijective 
Horn transformation, according to Proposition |3.7| r is definition 
preserving w.r.t. TLT>n and TLVs- This means that clause S T {Cn) 
exists that is equivalent to Cn- Then, Cs = chases s {S T {Cn)) is 
given by 

T(w) 4— Li(ui), • ■ ■ , L i i_ 1 (u i i_i), 

Sl(vi'),-‘‘ ,Sm{Vm'), 

L^-(-i(tr* * * ,L n /{u n /{. 

where v-j is a free tuple, 1 < j < m, and 
t(Li(ui), • ■ • , Li+i(Wi+i), • • • , L n (u n )) = 

L'i(u'i), ■ ■ ■ , L\,_ 1 (k',_ 1 ), L '/ +1 (w'/ +1 ), • • • ,L ' n ,«,). That is, 
all literals in the body of Cn are replaced by the equivalent literals 

in schema S. _ 

Assume that R{ui) is a blocking atom w.r.t. e! in Cn- Then for 
all 9 such that e' = T(w)6, we have that 

I \/ (Li(ui), ■ ■ • ,Li-i(ui-i),R(ui))6 

By definition, we know that R(ui) is the first literal with relation 
name R that is a blocking atom. Then, no literal in Li(ui), ■ • ■ , 
Li-i(iii-i) has relation name R and is a blocking atom. Accord¬ 
ing to the ARMG algorithm, R{ui) must be removed from Cn- 

Then, the resulting clause, which we denote C'n, has the following 
form 

T(w) <- Li(ui), ■ • ■ , Li-i(ui-i), 

Li-\-l{Ui-\-l{, - - - , L n {llri) • 


Now consider clause Cs ■ If no literals of S'i('Ui), • • ■ , Sm(v m ) 
are blocking atoms w.r.t. e' in Cs, then we would have that 

Jb (Li(«!),••• -0,51 ,S m (Vm))9 

However, this is equivalent to 

I b (Li(wi), • ■ ■ , Li-i(ui-i),R(ui))Q 

which we know is not true. Therefore at least one literal of Si (vi ), 
• • • , Sm(vm) must be a blocking atom. 

Assume the blocking atom w.r.t. e! in Cs is Sk{vk), Then, we 
have that 


Therefore, C s is equivalent to C'm. 

We now show that we also get equivalent results for transforma¬ 
tion t~ 1 , which is a vertical composition. Assume the input to the 
ARMG algorithm under schema S is the bottom clause _L e ,j and 
positive example e . The bottom clause _L e ,j is the ordered clause 
Cs, which was shown above. Then, Cm = chase-E n (S T -i(Cs)) 
is given by 

T(w) 4- Li(ui), ■■■ , Li-i(iii-i), 

R(ui), 

Li-\-l ('ni + i), ■ ’ ■ ,L n [Un), 


J 1/ (Li(ui), • • • , Si(t)i), ■ • • , S k (v k ))0 

Let clause CJg be the result of removing blocking atoms w.r.t. e! or 
atoms that are not head-connected in Si(vi), ■ ■ ■ , Sm(vm) from 
Cs- Because r is a bijective transformation, at least one literal 
of ■ • ■ , must share join attributes with S k (v k ), 

Let Sj(i>j), 1 < j < m, j ^ k, be this literal. Then, inclusion 
dependency S k [A k ] = Sj[Aj] must be in E s, where A k and Aj 
are join attributes. 

Because S k (v k ) is a blocking atom w.r.t. e !, the ARMG algo¬ 
rithm removes it from Cs- After S k (v k ) is removed, there are 
three possible cases: i) Sj(vj) is not head-connected, ii) Sj(vj) is 
head-connected but not a blocking atom, or iii) Sj ( Vj ) is a head- 
connected blocking atom. 

If Sj(vj) is not head-connected, then, according to the ARMG 
algorithm, it would be removed from Cs • Therefore, Sj ( Vj ) would 

not be in C^. Now we will show that if Sj(vj) is head-connected, 
then it must be a blocking atom w.r.t e! . Assume Sj(vj) is not a 
blocking atom w.r.t. e! , and S k (v k ) has already been removed from 
Cs ■ Then, we would have that 

Jb (Li (til),-. - 

Because of inclusion dependency Sfc[Afc] = Sj[Aj], there must 
be a ground atom S k (t) in J, where t is a tuple that shares join 
attributes with Sj(vj)9. If this is the case, then Cs should contain 
a literal S k (v k ) such that S k (v k )9 = S k (t), as Cs is a bottom 
clause. Also, (Sj (vj), S k (v k )) must be provable from J, as the 
bottom clause was generated relative to J using inverse entailment, 
which is a complete logical system when the language is restricted 
to Datalog programs. Then, the following must hold 

J b (Ll(wi), • • • ,4-r (ult-i),Sj(vj),S k (v' k ))0 

Therefore, S k (v' k ) is not a blocking atom. However, we know 
that S k (v' k ) is different from S k (v k ) because S k (v k ) is a blocking 
atom. Let the result of joining S k (v k ) and all literals in S'i(ui), 

• • • , Sm(vm) except S k (v k ) be literal R(u'i). This literal is differ¬ 
ent from the original literal R(ui), which was the blocking atom 
that we removed from Cm- R(u'i) is also in Cm and appears after 
R(ui), as R(ui) is the first encountered blocking atom. Because 
R(ui) and R(u'i) are different, Sj(vj) must be a blocking atom 
w.r.t. e! in Cs and must be removed by the ARMG algorithm. 

Removing Sj ( Vj ) from Cs would cause another literal in Si (vi), 

• • • , Sm(vm) to be removed, and so on. Then, all literals S'i(ui), 

• • • , Sm(vm ) would be removed from Cs, resulting in clause Cg 
given by 

T(w) 4— Li(ui), • ■ • , L i i_ 1 (u i i_ 1 ), 

Rj'-j-l (tL'-j-l), 5 -f / n / (^n' )• 


Assume that S(u k ), 1 < k < m, is a blocking atom w.r.t. e! 
in Cs ■ We have shown that if this is the case, all literals S i(vi), 

■ ■ ■ , Sm(vm ) would be removed from Cs, resulting in clause Cg. 
Because we have that t -1 (Si(ui), • ■ • , S m (v m )) = R(ui), then 
R(ui) must also be a blocking atom w.r.t. e! in Cm- Then, it would 
be removed from Cm, resulting in clause C'm, which is equivalent 
to C%. □ 

Proof of Theorem l7.ll 

PROOF. Let 1Z and S be two definition equivalent schemas. Schema 
1Z = (R, E) contains the single relation R(Ai, ■ ■ ■ ,Ai). Assume 
that l > 2 and there are l — 1 functional dependencies Ai —> Ai, 

2 < i < l, in E. Let S = (S, fl) be a vertical decomposition of 
schema 1Z, such that relation R(A\, • ■ • , Ai) £ R is decomposed 
into l — l relations in S in the form of Si(Ai, Ai), 2 < i < l. For 
each relation S z (Ai , Ai) £ S, H contains the functional depen¬ 
dency Ai —> Ai. For each set of relations Si(Ai, Ai), 2 < i < l, 

also contains 2(1 — 1) inclusion dependencies in the form of 
S 2 -Ai C Sj.Ai and S,.Ai C S 2 .A U 2 < j < l. 

Let pm be the number of relations in schema 1Z, am be the largest 
arity of any relation in 1Z, km be the largest number of variables in 
a clause, and m-m be the number of clauses in the definition of the 
target relation over 7Z. We define ps, ag, ks, and ms analogously. 
The largest number of constants (i.e. objects) in any example is 
denoted by n. Parameter n is a constraint on the answers of the 
oracle, therefore it is independent of the hypothesis space and the 
schemas. Because the number of relations in 1Z is pm = 1 and the 
maximum arity is a = am, then the maximum number of relations 
in 5 is ps = a — 1. We also have that as = 2. 

Let C be the hypothesis language that consists of the subset of 
Horn definitions that contain a single clause in which no self-joins 
are allowed. All definitions in C under schema 1Z have the form 

T( u) 4- R(xi,X 2 , ■ ■ ■ ,xi). 

where T is the target relation and u is a subset of {xi, x 2 , ■ • ■ , x{\. 

Any clause in a definition Hm £ £ under schema 1Z has at 
most l distinct variables, which corresponds to the arity of relation 
R. Therefore km = l. As schema S is a vertical decomposition 
of schema 1Z, and no self-joins are allowed in C, the definition 
S(Hm) = Hs £ £ also has at most l variables. We will use 
k = km to denote the upper bound on km and ks ■ Because defini¬ 
tions in £ consist of a single clause, then the maximum number of 
clauses in a definition m = 1. In general, mm = ms because S is 
a vertical decomposition of 7 Z. 

The upper bound on the number of EQs and MQs in the A 2 
algorithm is 0(m 2 pk a+3k + nmpk a+k ), and the lower bound is 
Q,(mpk a ) [251. In order to prove our theorem, the following should 
hold for 1Z and S 

Q.(mp(km) a ) > 0(m 2 p(a—l)(ks) 2+3ks +nmp(a—l)(ks) 2+ks ) 


where the left side of the inequality is the lower bound on the query 
complexity under schema 1Z and the right side is the upper bound 
on the query complexity under schema S. The operator > means 
that A2 will always ask asymptotically more queries under schema 
1Z than under schema S. We have that k-R and ks are bounded 
by k and m is the same for both schemas. We can also ignore 
n as it is independent of the hypothesis space and the schemas. 
Therefore, by canceling out some terms, the previous inequality 
can be rewritten as f l(k a ) > 0(m(a — 1 )k 2+3k + (a — 1 )k 2+k ). 

The first term in the upper bound dominates the second term, then 
we have Q.(k a ) > 0(m(a — 1 )k 2+3k ). Assuming that m = 1, as 
in C, we get Q(k a ) > 0((a — 1 )k 2+3k ). This inequality holds for 
sufficiently large k and a. □ 

B. SUPPLEMENT EMPIRICAL SETTING AND 
RESULTS 

Aleph and GILPS Configuration: In Aleph, we use the fol¬ 
lowing configuration: noise = 100%, minpos = 10, search = 
heuristic, evalfn = compression, nodes = 10000, clauselength = 
5, and i — 2. We use the default values for the rest of the parame¬ 
ters except for openlist. (beam), which we set to 1 to emulate FOIL 
and inf to emulate Progol. In ProGolem, we use the default con¬ 
figuration, except for the following parameters: noise = 100% 
and i = 1. 



